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(54) ! Method of using primary and secondary processors 

(57) The invention relates to the compilation of 
source code to a primary and a secondary processor, tt 
relates to reconfigurable secondary processors, and is 
especialy relevant to secondary processors which can 
be reconfigured to some degree during execution of 
coda Selective extraction of dataflows from the source 
code is followed by transformation of the extracted data- 
flows into trees. The trees are then matched against 
each other to determine minimum, edit cost relation- 
ships fcK transformation of one tree into another, where 
these minimum edit cost relationships are determined 
by the architecture of the secondary processor. A group 
or a plurality of groups of dataflows is determined on the 
basis of said minimum edit cost relationships and for 
each group a generic dataflow capable of supporting 
each dataflow in that group is created. The generic 
dataflow or dataflows is then used to determine the 
hardware configuration of the secondary processor; 
and calls to the secondary processor for said group or 
plurality of groups of dataflows are substituted Into the 
source code. The resultant source code is compiled to 
the primary processor. 

The resulting efficient configuration thus reduces 
either the expense of reconfiguration (in a field program- 
mable array), or the sifcon area On an application spe- 
cific integrated circuit). 
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Description 

[0001] The presaitinventkxi relates to the 

sisting of a primary processor and one (or more) secorrJary processors. The invention is particularly, though not exdu- 

5 sively, relevant to the archftecbres enrployi^ 

|0002] A primary processor - such as a Pentium processor in a conventional PC (Pentium Is a Trade Mark of Intel 
Corporation) - has evolved to be versatile, in that ft ts adapted to handle a wide rage of co m put ati onal tasks without : 
being optimised tor any of them. Su* a processor is thus rxrtopfim 
operations, such as parallel sub-v^testeSudi taste can care^ 

ro [0003] An approach taken to serve tte^ 

beater applications. These are known as ASICs, or appficafon-speeffic integrated circuits. Tasks for which such a ASIC 
is adapted are generaBy performed very well: however; the ASJC wffi gen erafty perform poorly, if at all on tasks for which 
it is not configured. Clearly, a specific IC can be buflt for a particular application, but this is not a desirable solution for 
applications that are not centra] to the operation of a computer, or are not yet determined at the time of building the com- 

is puter. It is thus particularly advantageous for a ASIC to be reconfigure, so that it can be optimised for different appi- 
cations as required. The commonest form of architecture for such devices is the field programmable gate array (FPGA), 
a fine-grained processor structure which can be configured to have a structure which is suited to any given application. 
Such structures can be used as independent processors in suitable contexts, but are also particularly appropriate to 
use as coprocessors. 

20 [0004] Such configurable coprocessors have the potential to improve the pe^ 

particular tasks, code run inefficiently by the primary processor can be extracted and run more efficiently in an adapted 
coprocessor which has been optimised for that appicatfon. With continued development of such "appfication-speeffic" 
secondary processors, the possibility of improving performance by extracting difficult code to a custom coprocessor 
becomes more attractive. A particularly important exanple In general computing is the extraction of loop bodies in 

25 image handing. 

[0005] To obtain the desired efficiency gains, it is nec^ssaiy to detenrine as effect 
divided between primary and sec»rxlary processors, and to configue the s 

its assigned part of the coda One approach is to mark the code appropriately on its creation for mapping to coproces- 
sor structures. In *A C + + compiler for FPGA custom execution units synthesis", Christian IseG and Eduardo Sanchez. 
so IEEE Symposium on FPGAs for Custom Computing Machines, Napa, Cafifornia, April 1995, a approach is employed 
which involves mapping of C + + to FPGAs in VUW (Very-Long Instruction Word) structures after appropriate tagging 
of the initial code by the programmer. This approach relies on the initial programmer making a good choice of code to 
extract initiaRy. 

[0006] An alternative approach is to assess the initial code to detem^ which the most appropria t e elements to direct . 

35 to the secondary processor wiB be. Two-Level Hardware/Software Partitioning Using CoDe-X", Reiner W Hartenstein, . 
JGrgen Becker and Rainer Kress, in Int IEEE Symposium on Engineering of Computer Based Systems (ECBS), Frie- 
drichshaf en, Germany, March 1 996, discusses a codesjgn tool which incorporates a profier to assess which parts of a 
initial code are sutaWe tor aflocafion to a coprocessor and which should be reserved for tie primary processor. TOs is 
followed by a iterative procedure aJtowirxj forconpilation of a subset of C code toa reconfigurabie coprocessor arcrt- 

40 tecture so that the extracted code can be mapped to the coprocessor. This approach does expand the usage of sei> 
ondary processors, but does not fUry realize the potential of rec©nfiguraWe logic. 

[0007] Comparable approaches have been proposed in tie BRASS research project at the University of Berkeley. An 
approach discussed in "DatapatrvOriented FPQA Mapping and Placement". Tim Caflahan & John Wawrzynek, a poster 
presented at FCCMW. Symposium on ReW-Pro^ammable Custom Computing Machines. April 16-18 1997, Napa 

45 Valley. California (currently available on the World Wide Web at httpy/Www.csi>erke- 
ley.edu^xq'ertsAxassyr^^ uses template structures representative of a FPGA architecture to 

assist in the mapping of source code on to FPGA structures. Source code samples are rendered as directed acycGc 
graphs, or DAGs, and then reduced to trees. These and other basic graph concepts are set out. for example, in "High 
Performance Confers for Parallel Computing", Michael WbBe, pages 49 to 56, AdcfisorvWesley. Redwood City, 1996. 

50 but a brief definition of a DAG and a tree follows here. 

[0008] A graph consists of a set of nodes, and a set of edges: each edge is defined by a pair of nodes (and can be 
considered graphically as a ine joining those nodes). A graph can be either directed or undirected: in a ejected graph, 
each edge has a direction, ff it possWe to define a path within a graph from one node back to Hsetf, then the graph is 
cyclic if not then the graph is acyefic A DAG is a graph that is both directed and acyefic: it is thus a hierarchical stroo- 

55 tura A tree is a specific kind of DAGL A tree has a single source node; termed •roof, and there is a unique path from 
root to every ofoer node in the tree. If there is an edge X-»Y in a tree, then node X is termed the parent of Y, and Y is 
termed the of X In a tree, a "parent node" has w 

ent, whereas in a general DAG. a chfld can have more than one parent Nodes of afreet 



2 



J. 



EP 0 926 594 At 



• nodes. 

[0009] Inihe work of Tim CaBahan& John Vtawzynek these trees 

tree covering' progam called Iburg. Iburg is a general^ available softie t^ 

Retargetable C Cornpiler: Design and Implementation", Christopher W. Baser and David ft. Hanson, ^airtrvXkjnw 
5 rraigs Puttering Ca. Inc., Redwood City, 1995. especiaBy at pp 373-407. Iburg takes as input the source code trees 
and partitions this input into chunks that correspond to instructions on the target processor. This partition is termed a 
tree cover. This approach is essential/ determined by the user-defined patterns allowable for a chunk, and is relatively , 
convex: it involves a bottorrvup matching of a tree with patterns, recording al possible matches, followed by a top- 
down reduction pass to deterrrine wtich match of patterns provides the lowest cost Again, this approach requires a 
fo significant Krtial constraint bi the form of the predefined set of afowable patterns, and does not fully rea^e the possi- 
bffities of a reconfiguraWe architecture. 

[0010] There is thus a need to develop techniques and approves to further improve 
terns in^)Mng a rximary and secondary pr 

secondary processor, which can then be configured as efficiently as possible to run the extracted code, with a view to 

15 maximising the performance efficiency of 1he primary and secondary processor system in execution of inputcbde. 
[0011] Accordmgty.theinverrtionp^^ source c<xje to a rximary and a secondary proc^ 

sor, comprising: selective extraction of dataflows from the source code;trarsfonnation of the extracted dataflows into 
trees; matching of the trees against each other to deterrrine minimum edit cost relationships for transformation of one 
tree fcto another^eterminhg a group or a plurality of groups of dataflows on the basis of said minimum edit cost rela- 

20 tionships and creating for each groip a generic dataflow capable of supporting each dataflow in tfiai groip; using the 
generic dataflow or dataflows to determine the hardware corfigu^ processor; and substituting into 

the source code calls to the secondary processor for said group or plurality of groups of dataflows, and compffing the 
resultant source code to the primary processor. 
10012] This approach allows for optimal sdecikw of sourt* 

25 without prejudgement of suitabflity (by, for example, mapping onto predetermined templates) but while still taking full 
account of the demands and requirements of the secondary processor architecture. Advantageously, said rrinimurn ecft 
cost relationships are determined according to the architecture of the secondary processor, and represent a hardware 
cost of a corresponding recorrfiguraiion of the second^ processor. The method is particularly effective if the rrinirnurn 
edit cost relationships are embodied in a taxonomy of minimum edit distances for classification of the trees. 

so [0013] The method finds its most useful application, where the hardware configuration of the secondary processor 
allows for reconfiguration of the secondary processor during execution of the source code, as this allows for reconfigu- 
ration of the secondary processor to be required during execution of the source code to stpport each dataflow in the 
group supported by a generic dataflow. The secondary processor may thus b^ 

esscr, atd the processor hardware may be afield programmable gate array or afield programmable arithmetic array 
as (such as that shown in the CHESS architecture discussed in Appendix A). 

[0014] Advantageously, the generic dataflow of a group is cadged by an appr^ 
group on to each other, followed by a merge operation. 

[0015] An advantageous approach to construction of a generic dataflow is to provide the dataflows as directed acy- 
cOcal graphs and reduce them to trees by removal of any finks in the cfirected acydical graphs not present in a critical 
40 path between a leaf node and the root of a directed acyefical graph, wherein a critical path is a path between two nodes 
which passes through the largest number of intermediate nodes. Alternative criteria to the critical path can be adopted 
if more appropriate to the secondary processor hardware^ 
sensitive to the timing of operations *m the seraidary processor). 

[0016] An advantageous further step can be taken after the creation of a generic dataflow, *\ which the generic dala- 
45 f tow is compared with further dataflows extracted from the source code, wherein those of said further dataflows which 
rratc^ sufficiently closely the generic dataflow are added to tie generic datafkw. The enables more or aB of the code 
present in the source code which is suitable for allocation to the secondary processor to be so allocated. 
[0017] In the approaches indicated above, the removed finks are stored after the cfrected acydical graphs are 
reduced to trees and are reinserted into the g eneric dataflow after the merging of the trees of the group into the generic 
so dataflow. 

[0018] SpecTicerrtoocfiments of the invention are descrbed below, by 
panyfog drawings, of which: 

Figure 1 shews a general purpose computer architecture to which embodiments of the invention can suitably be 
55 applied; 

Figure 2 shows schematicaBy a method erf exxritfrigs 
ing to an embedment of the invention; 

Figure 3 Oustrates a step of conversion of a DAG to a tree employed in a method step according to one embodi- 
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ment of the invention; 

'. Figure 4a illustrates the step of insertion and deletion of nodes and Rgure 4b illustrates the step of substitution of 
nodes in a tree matching process erT^loyed in a method step axor^^ 

Figure 5 shows an edit distance taxonomy provided in an example according to an embodiment of the mention ; 
5 Rgure 6 Qlustrates a generic dataflow prcvti 

Figure 7 shows a logical interface for a! location of secondary processor resources for a generic dataflow accortfing 
to an embocfiment of the invention; 

Rgire 8 shows the application of DAGs to dataflows including multiplexers to handle conditional statements; and 
Figures 9a to 9d show an ilustration of the merging of candidate dataflows to form a generic dataflow in an example 
ip aqronfingtoanembodirr^ 

[0^19] "The present invention is adapted for compilation of source code to an architecture comprising a primary and 
a secondary processor. An exanple of such an ar^ 

tional general-purpose processor, such as a Pentium II processor of a personal computer. Receiving calls from the pri- 
15 mary processor 1 and returning responses to it are secondary processors 2 and (optionally) 4. Each secondary 
processor 2,4 is adapted to increase the computational power and efficiency of the architecture by handling parts of the 
source code not weB handled by the primary processor 1. Secondary processor 4 P optionaly present here, fe a dedi- 
cated coprocessor adapted to handle a specific function (such as JPEG, DSP or the like) - the structure of this coproc- 
essor 4 wifl be determined by a manufacturer to handle a specific frequently used function. Such coprocessors 4 are 
20 not the specific subject of the present application. By contrast, the secondary processor 2 is not already optimised for 
a specific function, but is instead configurable to enable improved handling of parts of tie source code not wed handed 
by the primary processor. The secondary processor 2 is advantageously an application specific structure: it can be a 
conventional FPGA, such as the Xiinx 4013 or any other member of the Xflinx 4000 series. An alternative class of 
reconf igurable device, referred to as a field programmable arithmetic array is described in Appendix A hereto. Such a 
25 secondary processor can be configured for high computational efficiency in handling desired parts of the source code 
for an appfication to be executed by the arcrvtectura 

[0020] Also employed in the computer architecture are memory 3. accessed by the primary processor 1 and. for 
appropriate typesof secondary processor 2. by the secondary processor 2, and input/output channel 5. Input/output 
channel 5 here represents afl further channels and hardware necessary to enable the user to interact with the probes- 

30 sors (for example, by programming) and to aBow the processors to interact with all other parts of the computer device 6. 
[0021] The present invention is particularly relevant to the optimised partitioning of source code between primary 
processor 1 and secondary processor 2, which allows for optimal configuration of secondary processor 2 to optimise 
the handling of the application embodied in the source code by the architecture. A significant contribution is made by 
the invention in the selection a*xf extraction of code for use in the secondary processor. 

as |0022] The approach taken, according to an embocfment of the invention, is set out in Figure 2. The initial input to the 
process is a body of source coda to principle, this can be in any language : the example described was carried out on 
C code, but the person staled in the art will readily understand how the techniques described could be adopted with 
other languages. For example, the source code could be Java byte code: if Java byte code could be so handled, the 
architecture of Figure 1 could be partkaiarty well adapted to directly receiving and executing source code received from 

40 the intemet 

[0023] As can be seen from Figure 2, the first step in the process is the identification of appropriate canrJdaie code 
to be executed by the secondary processor 2 . TypicaBy, this is done by performing dataflow analysis on the source code 
and building a ppropr iate representations of the dataflows presented by selected lines of code (ip most processes, this 
. is normally preceded by a manual profSng of the code). This is a standard technique in compiling generally, and appi- 
45 cation to secondary processors is discussed in, for example, Athanas et a!, "An Adaptive Hardware Machine Architec- 
ture and Compiler for Dynamic Processor Reconfiguration", IEEE International Conference on Computer Design, 1991, 
pages 397-400. 

[0024] The approach taken here is to build directed acycfical graphs (DAGs) which represent the dataflows of selected 
code. An advantageous way to do this is by using a compiler infrastructure appropriately configured for the extraction 

so of dataflows: an appropriate compiler infrastructure is SUIF, developed by the University of Stanford and documented 
extensively at the World Wide Web site http7/suf Stanford edu/ and elsewhere. SUIF is devised for compBer research 
for Ngrvperformance systems, specrTicaBy including systems comprising more than one processor. A standard SUIF 
utifity can be used to convert C code to SUIF. It is then a simple process for one sJdQed in the art to use SUIF toots to 
build DAGs by performing a dataflow analysis over sections erf SUIF and then record ng the results of the analysis. 

55 [0025] The extraction of DAGs from source code is a conventional step. The next step in the process, as can be seen 
from Figure 2, is the conversion of these DAGs into tr ees. This step is a si gnrT cant factor in mating the optimal choice 
of code for execution by the secondary processor2. DAGs are complex structures, and cfifftcUt to analyse in an effective 
manner. Reduction of DAGs to trees allows the aspects of the dataflows most important in deterrrining their mapping 
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to hardware to be retained, while smpfifying tte ^ 
icarrtly more effective. 

[0026] Discussion of the reduction of DAGs to trees is made in Tfigh Performance Ck)^ 
(as cited above), especially at pages 56 to 60. Different terminology is used here from that used in the cfted reference, 
5 but equivalent and comparable terms are indicated below. The type of trees constructed here are directly comparable 
to the 'spanning trees" referred to in the cfted referehca 

[0027] The preferred approach Mowed in the reduction of DAGs to trees is the removal of links not in the critical path 
between leaf nodes and the root this is ilustrated in Figure 3. The critical path between nodes A and B is in a first 
embodiment of this reduction process deTned as the one thai touches th^ 
to definition, acyclic distinct paths can be defined to meet this crfterioa It is possfcle for there to be different paths 
between nodes that have the same maximum number of nodes^ 

purpose of tree constructioa While making an arbitrary selection between these paths is a valid approach, a key issue 
in mspping the source code successfully is scheduling, which depends on timing information: accordingly, where it is 
necessary to make a choice between alternative "critical paths" it is desirable to choose the one that would take the 
is longest time (in terms of time taken to execute each of the operations represented by the nodes in the path). Asisdis- 
. cussed further below, alternative approaches can be adopted which are based more directly on timing information. It is 
also desirable to adopt a consistent approach in making such choices - otherwise morphciogicaBy cfifferent trees can 
result from essentially similar DAGs. 

[0028] The process taken in applying this frst ernbodimert d the critical 

20 leaf node, every possHe path towards the root is chased: as the DAG is a directed graph, this is straightforward As 
indicated above, for each leaf node ft e path with the greatest number of nodes is chosen, and two paths are found to 
have the same number of nodes, a selection is made. This is the critical path for that leaf node Al other paths not 
selected are cut in their edge closest to the starting point This cut edge is termed a minor ink (equivalent to tie term 
"cross-Jink" in the Wolfe reference). The tree consists of the assembly of critical paths, and contains no minor links. The 

25 minor finks are stored separately. Minor links will be required when extracted source code is mapped to secondary proc- 
essor 2. but are not used in determining which source code is to be mapped to the secondary processor. 
[0029] It is of course possible to construct aces from DAGs without using the critical path criterion. Use of the critical 
path does provide particular advantages. In particular, removal as minor finks of the cross-inks not in the critical path 
wffl have little effect on scheduling, whereas if another approach was adopted removed cross-finks may have a conskJ- 

30 en^e irftjence on timing and hence 

represents as best possible the critical features of the DAG in the context of mapping to hardware 
{0030] Figure 3 shows the application of the process described in the preceding paragraph. Source code extract 1 1 
shows three fines under consideration for execution by secondary processor 2. DAG 1 2 shows these three fines of code 
represented as a directed acycfical graph, with root 126 (variable e) and leaf nodes 121, 129 and 130 as the inputs. 

35 [0031]. It is now a straightforward matter to assess each path from a given leaf node to the root, and to compare ihe 
number of nodes in each path. Rom node 129 (integer value 2), there is only one path, through nodes 122, 123,124 
and 125. This is then the critical path from leaf node 129 to root node 126. and will be^esent in the tree. From node 
121 (in the present case the result of an earlier operation and designated c), there are two paths. The first path passes 
through nodes 122, 123, 124 and 125, whereas the second path passes through nodes 127, 1 28 and 125. The first path 

40. is the critical path, as it passes through more nodes: the second path can thus be cut, as is discussed below: The 
remaining leaf node 130 (variable b) also has two paths available: one passes through nodes 123, 124 and 125, 
whereas the other passes through nodes 127, 128 and 125. These are equivalent in terms of number of nodes and so 
either path can be chosen as the critical path: however, for reasons discussed above (timing and morphological con- 
sistency) it is desirable to operate under an appropriate set of further rules to make the best selection. Such further 

45 rules may, for example, be determined on the basis of the relevant hardwara Here, the second path is chosen. 

[0032] The next step to take is to construct a tree 1 4 from the critical paths chosen from the DAG 1 2. This is done by 
cutting all roncritical paths in their edge closest to the starting point (that is. the edge closest to the starting point which 
is not also part of a critical path) . The first non-critical path to consider is that from nod e 1 2 1 to root 1 26 trough nodes 
127, 128 and 125. This can be cut on the edge between nodes 121 and 127 -in the tree, this is represented by removal 

so of edge 151 between nodes 141 (corresponding to 121) and 147 (corresponding to 127) which is stored separately as 
a minor link. The other non-critical path to consider is that from node 130 to root 126 through nodes 123, 124 and 125: 
this can be cut on the edge between nodes 130 and 123. Again, this cut edge is stored as a minor ink. 
[0033] It should be noted that condtionafs can be represented in DAGs and so reduced to trees in exactly the same 
way as simple equations. An example is shown in Rgure 8: this is a DAG representing the dataflow of the Bnes. 
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if(x<2) 
else 

a = / •' 

arid shows a multiplexer node 186 and a less than' operation node 186 in addition to the variable and integer nodes 
181 .182, 183 and 184. As the skflled man wifl appredale, ft wfllgener^ be possible to use the 
; for source code which can be represented as a DAG. 
[0034] The tree structure that te left - ^ 

is sarce code should be mapped 

above is a particularly appropriate one for converting DAGs to trees, as it is straightforward to i mplement, Is general in 
application, and through use of the critical path maintains the maximum "depth" of the computational engine to be syn- 
thesis ed (assuming each node represents a single computational element) because of the inclusion of paths with the 
maximum number of nodes. As the person sMIed in the art wfll appreciate, alternative approaches to determining which 

so edges are to be removed in converting the DAGs into frees can be adopted. One alternative embodiment of the DAG to 
tree reduction process is to assign a timing-based weight to every node (based, for exarrple, on the length of time 
required to execute the corresponcfing coiTputational element) and then to compare the accumulated weights of each 
path, selecting a path to define the tree accordingly on the basts of, for example, greatest accumulated weight This 
approach may be more appropriate H the timing parameters of the secondary processor 2 wi be a critical practical tac- 

25 tor and in particular if the timing dependencies are not mainly related to the mode counted (which may the case in struc- 
tures where, tor example, multiplication is several times rTK>re time constmring than adcBtion). 
[0035] The next step m the compilation process, as can be seen from Figure 2, takes trees as inputs and determines 
the selection of source code for the secondary processor 2. As is further austrated in Figure 2. this step of the process 
comprises a series of sub-steps. The first of these is the analysis and dassrrcafion of the trees resulting from the can- 
es? didatec&taflcv^Thrsisasi 

[0036] The objective in this stage of the compfetion process is to determine as best possfcle which of the cancfelate 
dataflows from the source code would be the best choices for execution by the secondary processor. This rs to a large 
degree dependent on the nature of the hardware in the secondary processor. An extremely eifrierrt mapping of source 
code to fie secondary processor 2 can be made where dataflows are sufficiently similar that broadly the same hard- 

35 ware representation can be used for each dataflow. It therefore folcws that good choices of candidate dataflows for 
mapping to the secondary processor can be made by finding sets of dataflows that are sufficiently simiar to each other. 
This is what is achieved by analysing and classifying the trees resulting from the cancSdate dataflows. 
[0037] A powerful technique for matching trees, used in this embodirnent of the invention, is the tree matching algo- 
rithm devised by Kaizhong Zhang of the University of West Ontario, Canada. 

40 [0038] This algorithm is described in Kaizhong Zhang, "A Constrained Edit Distance Between Unordered Labelled 
trees", Algorithmic^ (1996) 15205-222, Springer Vertag, and is provided as a toolkit by the University of West Ontario, 
the tooHdt being at the time of writing obtainable ever the internet from ftp://ftp.csd.uwaca/|^^ 
ft wil be appreciated that alternative approaches of matching trees to determine a degree of sirr^arity therebetween are 
available to the skilled man. The approach to tree matching used in this errtodiment of the invention is described below. 

45 [0039] The principle of operation of Zhang's algorithm is the following: two trees are compared node-by-node through 
a cfynarric programming technique that minimises the edit operations required to transform one tree into another. This 
cost of transformation is termed here an edit cost The edit costs of successively larger subtrees are cross-compared, 
wilh a record being kept of the minimum costs found. The computational structure can be characterised as that of a 
recursive dynamic program which uses a working dynamic programming grid to calculate component subtree distances 

so and records the result on the main grid 

[0040] The edt operations avaHabe are insertion, deletion and substitution. These are shown in Figures 4a and 4tx 
Figure 4a shows two trees: tree 151 wrm fr^ rxxies arxi tree 152 with six nodes. The strxx^eol the trees can be made 
identical by addition of a nodebetweennodes3and5of free 151: this nw 

sequertfjy tjansbrmation of tree 151 to tree 152 is achieved by insertion of this node, and t ra nsfo r mation of tree 152 to 
55 tree 151 is achieved by deletion of it On the CHESS architecture described in Appendix A, deletion" "rs represented in 
hardware by "bypass" of a unit of the array: this is an example of an archftecturaly designed cost - in this case, a 
extremely low cosQ. For Figure 4b, the two trees 151 and 152 have the same structure, but the two nodes 4 represent 
a different type of operation in each tree: ft is therefore necessary to substitute for node 4 in tran s forming one tree to 
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the other. Every node therefore needs a TabeT: a tag attached to the node which identifies the type of node among the 
various types of node possMe. 

[0041] As previously Indicated, eadi of these ecfrt operations has a c»stThfe en 

for exanple, the same result may be achieved in some architectures either by an insertion and a deletion, or by a sub- 
5 stitution: the costs of these dHf erent alternatives can be compared 

[0042] The result of the comparison of two trees by this algorithm is the prcd^^ 
where t1 belongs to the first tree and t2 belongs to foe second t ea Eart 

points in the two trees, suggesting foe mapping of U and t2 on to each other. The fist of pairs effectively defines the 
skeleton of a tee which can contain either of the compared trees: in this skeleton, to transform the first tree into the 

io second tree, each nodetl has to be substrMed with the respectte 

either inserted or deleted depending on which tree they belong to, as is discussed further below. For this list of pairs 
there wil be def ined an edit distance: this is the minimum in edit costs cumulated over the pairs necessary to transform 
one tree to the other. The algorithm is devised to determine an edit distance between two trees, together with the set of 
transformation which achieves that edit distance: alternative transformations wil be posstte, but they wfll have a 

75 higher associated cunuilativeecftcosL 

[0043] The value of computing an edit distance based on edit costs is that the edit costs. may be chosen to represent 
the -hardware cost" in reconfiguring the secondary processor from the configuration representing one tree to a config- 
uration representing foe other tree in a mapping. This "hardware cost" is typicafly a measure of the quantity of second- 
ary processor resources that wfll be taken up to achieve the second configuration given the existence of the first - this . 

20 can be considered, for examplp, in terms of the additional area of device used. These costs wffi be determined by the 
nature of the secondary processor hardware, as for different types of hardware foe physical realisation of insertion, 
deletion and substitution operations wil be different For foe recontlgurable CHESS array dsaissed in Appencfx A, a 
"bypass* operation involves minimal cost a substitution between an adds and site (addition and subtraction opera- 
tions) has low cost whereas substitution between mute and dh/s (muttiptication and drvision operations) is expensive 

25 [0044] As indicated above, an edit distance between two trees can be constructed However, a further step can be 
taken: using Zhang^ algorithm, a a 

each one of a set of trees. This taxonomy can readily be provided in foe form of a tree, of which an example is shown 
in Figure 5. Each leaf node 161 of foe tree represent a candidate tree extracted from a DAG, and each intermediate 
node 1 62 represents an edrt cost The tree provides a unique path between each pair of leaf nodes. The edit distance 
30 between the two leaf nodes of a pair is found by nation of costs provided at each intermediate node on this path. For 
example, the ecfitdistance between any pair of foe leaf nodes representing Tree#4, Tree#5 or Tree#6 is & However, foe 
edit cfistance between Tree#1 and Tre*#4 is 496: the summation of intermediate nodes with values; ol 12, 221, 107, 50 
. and 6. . 

[0045] This taxonomy is indicative of foe number of edit operations required to translate between trees. Such a tax- 

35 ononry is a valuable tool, as rt can be used heur^ 

The creation of a taxonomy thus renders it easy to determine which trees are sufficiently similar to be consolidated 
together (as will be dscussed below), and which are too diverse for this purpose. This ran be done by imposition of an 
edit distance threshold. A group of trees can be selected for consolidation ff foe edit distance between each id every 
possible pair of trees in foe group is less than foe edit distance foreshow. The value of foe edit distance threshold is 

40 arbitrary, and can be cr^ 

in order to optimise foe performance of foe system. 

[0046] The advantage of consolidating a group of trees is that a common hardware configuration can be used tor foe 
whole group and wil support the function of each tree This is particularly appropriate fpr architectures, such as 
CHESS, in which low-latency partial reconfiguration mechanisms are available on the secondary processor. Reconfig- 

45 uration is required to change foe configuration trcrn foal to support foe fur^ 

of another tree : however, as foe edit distance between fo ese trees wil I never be greater than the edrt cost threshold, foe 
degree of reconfiguration required is already known to be within acceptable bounds! The group of trees are consofi- 
dated togefoer by construction of a 'supertree" which contains a representation of every component tree. After it has 
been constructed, foe supertree can be converted into a representation of each of the relevant DAGs extracted from foe 

so source code by reinsertion of the previously removed minor finks. The hardware configuration may then be determined 
from the ful supertree. The construction of the supertree is discussed in detai below. 

[0047] Rgure 6 illustrates the step of construction of a supertree from a group of trees which fell below the specified 
edit cost threshold: such a group of trees is here termed a class. The trees 171. 172 and 173 can al be mapped 
togefoer into supertree 170. The reconfiguration required to change foe hardware oonfigurationfromfoattosupport for 
ss example, tree 1 71 to that of tree 1 72 is sufficiently limited to be realizable in practice, because the edit cfistance between 
the two trees is below the ecfit cost threshold. 

[0048] An exemplary supertree assembly algorithm, merge, is provided as C code in Appendix a The function of the 
algorithm is descrtoed below, with reference to Rgure 9. The algorithm contains the following elements: 
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merge: 

[0049] The tree in tie class with the largest number of nodes is chosen to be the initial merge tree - if there are trees 
with an equal number of nodes, an arbitrary selection can be mada The remaining trees are termed source trees. 
5 [0050] For each souroe tree the following operations are then applied: 

From the mapping between the source tree and the merge tree which has been calculated {in this embodiment, 
from Zhang's algorithm and edit costs determined from the secondary processor architecture), the supertree is 
constructed as follows: 
io ' '" • • 

1. RrsHy, mapped nodes ^ctosest to the root are considered; 

2. The source tree operation (source operation) is concatenated to the corresponding mapped merge tree 
operation (merge operation); 

3. For each child operation of the source operation 

15 

a. tf the chfldis mapped, revert to step 2 wrth respect to the source child 

tx If the chid is not mapped, then consider whether there is any mapping in the subtree of which the child 
is the root (source subtree): 

20 I tf ffiere is no further mapping, simply adopt the source subtreefor merging into the merge tree under 

the corresponding merge tree node. 

fi. If there is a further mapping inside the source subtree, connect the subtree as follows: 

a. If the merge operation of this subordinate mapping fafts outside the previously mapped subtree. 
25 remove the mapped source operation from the source tree. There is recursion present at this 

stage - where mapped children have already been dealt with, all that needs to be done is to 
remove what would otherwise be a cross tree link. 

b. This is shown in Figure 9. If the merge operation of this subordinate mapping does fall within 
the previously mapped subtree, climb up the merge tree until the least common ancestor for ail 

so contained suborcCnate mappings is found. The least common ancestor is the first node to contain 

ail of the source mappings. The unmapped source segment is then mapped into the merge tree 
by inking the source operation of the unmapped source subtree as a child of the least cormion 
ancestors parent and by linking the least common ancestor as the child of the unmapped source 
operation just above the closest mapped source operation in the current subtree (where the "clos- 

35 est mapped source operation" delimits the lower end of an unmapped segment of the source tree, 

and is a mapped node which fells within the subtree of the current mapping - the source node's 
parent, which is unmapped, adopts the merge tree's least comrWancestor as a child and vice 
versa). 

40 ■ The pair of imermingl ed trees are normalised into a singl e tree, which forms the new merg e tr ee. 
The procedure continues until aS the source trees In the class are contained wftto 
supertrea 

[9051] This process is indicated in Figure 9. Figure 9a shows two dataflow trees, a merge tree 201 and a source tree 
45 202. There are three mappings made between nodes made by the comparison algorithm - the remaining nodes need 
to be inserted appropriately. As indicated in section 1 above, the first step is to consider the mapped operations nearest 
the root - in this case, at the root These operations A are concatenated. 

[0052] After tits, the child nodes of A in the source tree are considered. Node B does not have a mapping and is not 

an ancestor to any mappings - ft is therefore merged as a chid of A* (see Figure 9b). The other chfld node of A, C, 
» does however have descendant mappings (D and F which map to D and E in the merg e tree) . Both the relevant merge 

operations faB in the previously mapped subtree (as they are both descendants of A), lis therefore necessary to foBow 

the course set out in section 3(b)(i)(b) above. The least common ancestor containing both mapped merge operations 
* D and E is X. C of the source tree is thus inked into the merge tree as chfld of A^ (the parent of X) and parent of X. 

This arrangement is shewn in Figure 9b - the merging is completed by concatenation or merging of the remaining nodes 
55 of the source tree, all of which steps are straightforward. 

[0053) The resultant supertree 203 is shown "m figure 9c This supertree 203 acts as merge tree for the merging in 

of a forther canrJdate source tree 204, as sta>wn in Figure 9d. In this 

a supertree node - merging is thus entirely straightforward, and consists only of concatenation (ie substitution). This 
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process continues unta all the candidate trees are merged into a supertree. 
[0054] At this stage/it is jttssUe folates 

ary processor. The source code will contain DAGs other than those which have been selected for inclusion of the super- 
tree: for example, DAGs which have not been considered because they do not Ge at one of the most computational 

5 intensive "hot spots" of the code. However, the code of these DAGs may also run more quickly if executed on appropri- 
ately adapted secondary processor rather than on the primary processor, ft can thus be advantageous to compare such 
remaining DAGs with the supertree by a backmapping process. Processes derived from conventional backmapping 
techniques, such as Iburg, can be utffised for this purpose. However, the most advantageous approach may be to return 
to use of Zhang's algorithm, and match further candidate trees in the source code against the supertree, but this time 

10 with a lower edit cost threshold. Where the trees derived from such DAGs CCT 

tree, or where the edit cost for such a mapping tails below some mranum level, then the code of these DAGs can also 
be allocated to the secondary processor and the supertree modffied, if necessary. Control information related to any 
such dataflows added by this bacfotiapping process needs to be stored also. 

[0055] From this supertree, it Ss then straightforward to insert the minor links which were removed from the DAGs on 

75 their conversion into trees (inducing here any DAGs added from the backmapping process, if employed). The resuming 
. structure is a class dataflow, which represents all the information present in the DAGs of the class: control information 
for the supertree (for example, to determine any reconfiguration that is to occur) must also be present This dass data- 
flow can be used for the purpose of determining the hardware configuration of the secondary processor, and can also 
be used to provide a structure for enabing stitching back into the source code appropriate calls to the secondary proc- . 

20 essor: these steps are described further below. 

[0056] Stitching calls to the secondary processor back into the source code in fact requires only the supertree, and 
not the class dataflow; as the supertree prescribes the periphery off the dataflow. The actions required with respect to 
any replaced dataflow in the source code are replacement of inputs of the dataflow (leaves of the tree reduced torn that 
dataflow) with load primitives and of the output of the dataflow (root of the relevant tree) with a read. The leaves and 

25 roots of the relevant tree are contained in the supertree, so only the supertree is required for the purpose, Al remaning 
code subsumed in the dataflow can amply be removed, as it is replaced by the secondary processor configuration. 
[0057] Figure 7 shows a logical interface for achieving the necessary substitutions into the source coda An input tree, 
labelled Input Tree #3. is shown, together with a supertree. labelled PFU Tree. Each node in Input Tree #3 hasits own 
unique operation ID obtained from the compiler internal form representation. For the supertree (PFU Tree), registers or 

30 other I/O resources are allocated to the leaves and the root The implicit mapping between Input Tree #3 and PFU Tree 
thus provides a correspondence between operation IDs of the Input Tree nodes and the I/O resources aBocated for PFU 
Tree in the form of a spetificatiofe ^ 

removal of the code subsumed by the PFU and the substitution of the necessary ID primitives in the coda 

[0058] From the class dataflow, it is possible to configure the secondary processor. This step can be conducted 

35 according to known approaches, by reduction of the dass dataflow to a netiist (wit h insert, cieieie and substitute oper- 
ations, and inducing in appropriate form any dynamic reconfiguration instructions), and then mapping the netfist to the 
specific secondary processor hardware, taking into account requirements of reconfiguration between component data- 
flows. For conventional FPQA architectures, these steps can be earned out essentially by use of appropriate known 
tocte For example, in the case of a standard XBinx FPGA such as the XC4013, then appropriate Xiinx proprietary tools 

40. can be used. Firstly, the netiist can be rendered in Xflinx netfist format (XNF). This can then be followed by partitioning 
into configurable logic blocks and input/output blocks by the Xifinx Partition Place and Route program (PPR), with the 
resultant being converted to a configuration bitstream by the XBinx MakeBHs program. This approach is dfecussed, 
together wfth further discussion of provision of predetermined reconfiguration solutions, in ."Run-Time Programming 
Method for Reconfigurable Computer by Steve Casseiman, currently available on the World Wide Web at 

45 hlipy/www.reconfig.c^^ a contribution to the World Wide Web roundta- 

ble on reconfigurable computing operated by SB Associates, Inc. of 504 Nino Avenue. Los Gatos. CA 95032, USA. 
Essentially similar procedures can be followed for alternative types of configurable arid reconfigurable processor, such 
as the CHESS device descrfoed in Appendix A, using toote appropriate to the processor concerned. 
[0059] Once the source code is generated h executable form with appropriate caDs to the secondary processor, and 

so once the secondary processor configuration has been determined, the source code can be loaded and executed. The 
source code is executed in the primary processor with 
ondary processor is specifically adapted to process the dataflows eodrart 
rrticantiy increased For eoample. a 25% improvement was fourxln 
invention to the iDCT algorithm from the JPEG tooto 

ss secondary processor because of I/O constrains! 

[0060] The melhods here described are thus particularly effective to aflow for opting 

in an architecture a>mprising a primary processor and a reconfigurable secondary processor. 
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APPENDIX A . 
CHFSg ayray 

The CHESS array is a variety of field programmable array in which the programmable . 
elements are not gates, as in an FPGA, but 4-bit arithmetic logic units (ALUs). The array 
configuration is desmbed in d 

ALU structure and provision of instruction to ALUs is discussed in a copending application 
entitled 'Reconfigurable Processor Devices"! and filed on the same date as the present 
application. 

Tie CHESS array consists of a chessboard layout with alternating squares, comprising an ALU 
and a switchbox structure respectively. The configuration memory for an adjacent switchbox 
is held in die ALU. Individual ALUs may be used in a processing pipeline, and in a preferred 
implementation, provision is made to allow dynamic provision of instructions from one ALU 
to determine the function of a succeeding ALU. ALUs are 4-bit, with four identical bhslices, 
with 4-bit inputs A and B taken directly from an extensive 4-bit interconnect wiring network, 
and 44>it output U provided to the wiring network through an optionally latchable output 
register: 1-bit carry input and output are also provided and have their own interconnect 

Dynamic instructions are providable from the output U of one ALU to a 4-bit instruction input 
I of another ALU. The carry output of one ALU can also be used as C m of another ALU 
with the effect of changing die instruction of that ALU. 

The CHESS ALU is adapted to support multiplexing between A and B inputs, and also 
supports multiplexing between related instructions (eg OR/NOR* AND/NAND). 
Reconfiguration between such instructions can be achieved through appropriate use of the 
carry inputs and outputs without consumption of silicon. More complex reconfigurations (eg 
AND/XOR, Add/Sub) can be achieved through using two ALUs, the first to multiplex between 
the two alternative instructions and the second to execute the chosen instruction on the 
operands. Multiplication will take up more than a single ALU, making reconfiguration 
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involving a multiplication operation n*n? complex. It is straightforward using the multiplexer 
capacity of a CHESS ALU to "bypass" an operation, with appropriate control resulting in 
either performance of the operation or propagation of a given input 

A sample set of functions obtainable from the instruction inputs is indicated in Table Al 
below: a wide range of possibilities are available with appropriate logic in coimectipn of the 
instruction inputs to the ALU. The functions are described in Table A2. 
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Table Al: Instruction bits and corresponding functions 



11 



EP 0 926 594 A1 





Name 


U function 


Cggf function 
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ADD 


A plus B 


Arithmetic carry . 




SUBA 


. A minus B . 


Arithmetic cany . 




A AND B 


Uj = Aj ANDBj 




10 . 
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U l = A i ORB i 
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Uj = NOT (Aj OR Bj) 
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15 
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Com = Cjn 
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Uj = NOT (Aj XOR Bj) 
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A AND B 


Uj = A, AND (NOT Bj) 


Coot = On 


so 


B AND X 
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Not applicable 


if A == B thenO, elsel 


35 


MATCH 1 


Not applicable . 


bitwise AND of A and B, 
followed bv OR across 
width of the word 


40.-. 


MATCHO 


Not applicable 


bitwise OR of A and B t 
followed by an AND across 
the wHth of the word 



45 - ~ — ; : 

Table A2: Outputs for instructions 



2s complement arithmetic is used, and the arithmetic carry is provided to be consistent with 
this arithmetic. The MATCH functions are so-called because for MATCH1 the value of 1 is 
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Claims 

1. Amelhodd conrplngsa 

selective extraction of dataflows from the source code; 
transformation of the extracted dataflows into trees; 

matching of the trees against each other to determine rriramum edit cost relationships for transformation of one 
tree into another; . • ' , ' 

determining a group or a pkiBSty of groups d datafk>ws <xi tr»e basis of sakj 
and seating for eadi group a generic da^ 

using the generic dataflow or dataflows to determine the hardware anfiguratton of toe secondary processor; 

and : •. 

subsffiuting into the source code calls to the secondary processor for said youp or plurality of groups of data- 
flows, and compiling the resuftant source code to the primary processor. 

2. A method as claimed in d^ 

imum edit distances for dassffica^ cff the trees. 

3. A method as claimed in daim i or daim 2, wherein said minimum edit cost relationships are determined according 
to the architecture of the secoridaryprocessor. and represent a haithware cost of a correspdnding [ reconfiguration 
of the secondary processor. 

4. A method as claimed in any of claims 1 to 3. wherein the hardware configuration of the secondary processor allows 
for reconfiguration of the secondary processor during execution 

5. A method as claimed *m daim 4, wherein the secondary processor is an appfication specific instruction processor. 

6. A method as claimed in daim 4. wherein the secondary processor is a field programmable gate array. 

7. Amethcriasdaimedhcla^ 

8. A method as daimed in any of claims 4 to 7, wherein reconfiguration of the secondary processor is required during 
execution of tie source code to support each dataflow m the grou^ 

9. A method as daimed inany preceding daim, wherein a generic dataflow of a group is calculated by an approximate 
mapping of dataflows in the group on to each other, foBowed by a merge operation^ 

10. A method as claimed in any preceding daim, wherein the dataflows are provided as directed acydica! graphs and 
are reduce to trees by removal of any finte in the directed a^ 

a leaf node and the root of a drected acycBcal graph; 

11. A method as daimed in claim 10. wherein the critical path is a path between two nodes which passes through the 
largest number of intermediate nodes. )) 

12. AmetrxxJasdaimedin ciaimlO.wr^ greatest accumu- 
lated execution time. 

13. A method as claimed to any of daro 

is compared with further dataftowsextrariedfrom 

wherein those of said further dataflows which match sufhctentiy dosety the generic dataflow are added to the 
generic dataflow. 

14. Anxithodasciaimedinanyofda^ dependent on daim 9, wherein tr^ removed finks are 
stored after the cfirected acycfical graphs are reduced to trees and are reir^ed irrto the generic dataflow after the 
merging of the trees of foe group irrto the generic dataflow, ' 
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Figure; 9a Two Dataflow Trees 




Figure: 9b Merged Composition Figure: 9c New Supertree 




Figure: 9d Next Candidate Mapping 
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